In this project the goal is to investigate a data set and make an attempt to answer predefined questions. The project is part of the course ‘Verarbeitung und Speicherung großer Datenmengen’ in the Wintersemester 2023 at the Fachhochschule Wiener Neustadt.
The preset structure of the project is given as follows:
Initially we need to load the necessary packages. (Chunk option: results=‘hide’)
# Define the packages to install
packages <- c("ggplot2", "reshape2", "hexbin", "tidyverse", "modelr", "reshape2", "MASS", "dplyr", "ggraph", "gridExtra", "glmnet")
# Loop over the packages and install them if they are not already installed
for (package in packages) {
if (!require(package, character.only = TRUE)) {
install.packages(package)
}
}
## Loading required package: ggplot2
## Loading required package: reshape2
## Loading required package: hexbin
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: modelr
##
## Loading required package: MASS
##
##
## Attaching package: 'MASS'
##
##
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Loading required package: ggraph
##
## Loading required package: gridExtra
##
##
## Attaching package: 'gridExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## combine
##
##
## Loading required package: glmnet
##
## Loading required package: Matrix
##
##
## Attaching package: 'Matrix'
##
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
##
## Loaded glmnet 4.1-8
# Load the packages
library(ggplot2)
library(reshape2)
library(hexbin)
library(tidyverse)
library(modelr)
library(reshape2)
library(MASS)
library(dplyr)
library(ggraph)
library(gridExtra)
library(glmnet)
To allow for reproducible research a seed is set to facilitate reproducible research for processes with randomization.
set.seed(202375)
The data set was downloaded to the local machine (Windows 11, OS build 22621.2861)and uploaded into the server directory via the Rstudio for Browser interface, since the data set is not the opensource version. During the Analysis of the data set, it was recognized as a reduced data set constructed from the [King County House Sale Prices] (https://github.com/viviandng/KC_House_Price).
The same data set was used in the Kaggle-competitionkc_house_data, where some intresting approaches for prediction and analysis can also be found.
The difference between the original and provided data set were in the number of columns/variables of each, since some of the columns in the original set were dropped.
To get some grasp of the data set we are handling, we can look at the structure of our import.
head(raw_data)
This tells us, that we have 15 Columns, and most often integer or numeric values.
str(raw_data)
## 'data.frame': 21613 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living: int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
The data set has 21613 entries (observations).
# Create a new dataframe with average price for each lat-long pair
# avg_price_df <- raw_data %>% group_by(lat, long) %>% summarise(avg_price = mean(price))
data.frame(long = raw_data$long, lat = raw_data$lat) %>%
ggplot(aes(long, lat, fill = after_stat(count))) +
geom_hex(bins = 75) +
scale_fill_viridis_c() +
labs(
title = "geographical sales distribution",
x="Longitude",
y="Latitude"
)
data.frame(raw_data$long, raw_data$lat, raw_data$price)%>%
ggplot(aes(raw_data$long, raw_data$lat, z=raw_data$price)) +
stat_summary_hex(fun = function(price) mean(price), bins = 75) +
scale_fill_viridis_c() +
labs(
title = "geographical value distribution",
x="Longitude",
y="Latitude"
)
Here we can see some interesting details, which ultimately could lead to
deeper insights. For example, that the sales in general seem to be well
distributed, with a region in the North-West were relatively more houses
were sold. Additionally we can conclude, that a region between some
Lakes (Lake Washington and Lake Sammamish) correlate with the higher
house prices (for sold houses).
Are we fine with the datatypes assigned?
Interestingly the ‘bedrooms’ and ‘bathrooms’ columns, do not have the
same datatype (int or num) so an investigation into at all the dataypes
should be performed and, if deemed necessary, transform our variables to
a more fitting type. Additionally, here the decision to manipulate the
data in a new dataframe (called df) was made, to keep the original data
for developing purposes.
df<-raw_data
df$waterfront <- ifelse(df$waterfront == 0, FALSE, TRUE)
unique(df$waterfront)
## [1] FALSE TRUE
Transforming the zip-code from integer to a categorical variable, since they do not have a numerical meaning.
df$zipcode <- as.factor(df$zipcode)
length(unique(df$zipcode))
## [1] 70
Transforming the view into categorial, for similar reseons as above.
(unique(df$view))
## [1] 0 3 4 2 1
Investigating the numerical or integer values of these. (Which values are underneath the variables.)
print(sort(unique(df$floors)))
## [1] 1.0 1.5 2.0 2.5 3.0 3.5
class(df$floors)
## [1] "numeric"
print(sort(unique(df$bedrooms)))
## [1] 0 1 2 3 4 5 6 7 8 9 10 11 33
class(df$bedrooms)
## [1] "integer"
print(sort(unique(df$view)))
## [1] 0 1 2 3 4
class(df$view)
## [1] "integer"
print(sort(unique(df$grade)))
## [1] 1 3 4 5 6 7 8 9 10 11 12 13
class(df$grade)
## [1] "integer"
print(sort(unique(df$condition)))
## [1] 1 2 3 4 5
class(df$condition)
## [1] "integer"
The datatypes of these variables above might hint to their nature, but it is somewhat unclear what a value of 2.5 represents in the ‘floors’ category.
But how does the waterfront influence the price?
p_wf_1 <- ggplot(df, aes(x = waterfront, y = price)) + geom_boxplot()
p_wf_2 <- ggplot(df, aes(x = waterfront, y = price)) + geom_violin()
grid.arrange(p_wf_1, p_wf_2, ncol=2)
Here a logistic regression is appropriate, since waterfront is not a continues variable:
# Fit the logistic regression model
model_logistic <- glm(waterfront ~ price, family = binomial, data = df)
# Family objects provide a convenient way to specify the details of the models
# Create a new plotting window and set the plotting area into a 1*2 array
par(mfrow = c(1, 2))
# Print the summary of the first model
plot(model_logistic)
summary(model_logistic)
##
## Call:
## glm(formula = waterfront ~ price, family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.501e+00 1.384e-01 -46.98 <2e-16 ***
## price 1.948e-06 9.019e-08 21.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1918.0 on 21612 degrees of freedom
## Residual deviance: 1488.1 on 21611 degrees of freedom
## AIC: 1492.1
##
## Number of Fisher Scoring iterations: 8
The plots can be interpreted as follows:
For the logistic regression model the Akaike Information Criterion (AIC) is given with a value of 1492.1. Unfortunately this value in itself gives only relative information and it could be later used to compare it to other models. The information contained in it is based on the goodness of fit, and the complexity of the model.
Without further knowledge about the origin and nature of the data set, we are going to continue to step 2 of the exercise, to visualize the distribution of the variables. here we are going to differentiate between the datatypes of the variables.
Interestingly the category ‘floors’ is an numeral, and increases in half steps, till it reaches its maximum at 3.5.
floor_counts <- table(df$floors)
# Convert the table into a data frame
floor_df <- as.data.frame(floor_counts)
names(floor_df) <- c("Floors", "Count")
# Create the bar plot
ggplot(floor_df, aes(x = Floors, y = Count)) +
geom_bar(stat = "identity") +
theme_minimal() +
labs(
title = "Distribution of the number Floors",
x = "Number of Floors",
y = "Count"
)
In this analysis, we are not going to delve into the geographical analysis, and therefore, we are going to exlude the last two categories “lat” and “long”
# drop the 'zipcode' and 'lat'-columns
df_num_subset <- df_num[1:12]
head(df_num_subset)
(displaying extreme values)
During the previous visualization step, some individual entries, with
extreme values were detected. We want to identify them, and remove them
from the data set, since they are not going to help explain the majority
of the data. Interesting values were detected in the prize- and
bedrooms-category. All other seem not that extreme.
# show entry with more than 30 bathrooms
dim(df_num) # length of dataframe, for validation purposes (! different dataframe, since we are going to overwrite the subset)
## [1] 21613 13
# return the entries, which we are going to remove
print(df_num_subset[(which(df_num_subset$bedrooms>30)),])
## X price bedrooms bathrooms sqft_living sqft_lot floors view
## 15871 15871 640000 33 1.75 1620 6000 1 0
## condition grade yr_built lat
## 15871 5 7 1947 47.6878
# remove the entires and overwrite the dataframe
df_num_subset <- df[df_num_subset$bedrooms <= 30, ]
dim(df_num_subset) # validate the removal, by comparison of the dataframe
## [1] 21612 15
# Get the names of all columns except the first one
second_col <- colnames(df_num_subset)[2]
# Identify the boolean columns
bool_cols <- sapply(df_num_subset, is.logical)
# Drop the boolean columns
df_clean <- subset(df_num_subset, select = -which(bool_cols))
# Drop 'zipcode' from the dataframe, since we need a numerical later on
df_clean_numerical <- subset(df_clean, select = -zipcode)
to identify variables which are highly correlated the pearson correlation factor can be calculated. This should allow for the filtering of the variables which are highly correlated with price.
# Calculate a correlation matrix
corr_matrix <- cor(df_clean_numerical)
# Reshape the correlation matrix into a long format
melted_corr_matrix <- melt(corr_matrix)
# Create the heatmap
ggplot(data = melted_corr_matrix, aes(Var2, Var1, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1)) +
coord_fixed()
To create a accurate model, we are going to model individual variables and comparing their accuracy. As a measure for comparison the Mean Squared Error or the Root Mean Squared Error will be evaluated.
# Get the correlations with price
corr_price <- corr_matrix[,"price"]
# Sort the correlations in descending order
sorted_corr_price <- sort(corr_price, decreasing = TRUE)
# Print the top 5 correlations
head(sorted_corr_price, 6)
## price sqft_living grade bathrooms view bedrooms
## 1.0000000 0.7020466 0.6674473 0.5251471 0.3972989 0.3154449
so now we know, that the highest (Pearson) correlation to price in our data set is with the variables sqft_living, grade, bathrooms, and view.
{section_plots}
# Loop over the columns
for (col in colnames(df_clean)[-2]) {
# Create the plot
p <- ggplot(df_clean, aes(.data[[col]], .data[[second_col]])) +
geom_hex() +
labs(title = paste("Scatter-(/Hex-)plot of", col, "vs", second_col), x = col, y = second_col)
# Print the plot
print(p)
}
# Use 70% of the data set as the training set and 30% as the test set
train_sample <- sample(c(TRUE, FALSE), nrow(df_clean), replace=TRUE, prob=c(0.7,0.3))
# Assign the training and test sets
train_set <- df_clean[train_sample, ]
test_set <- df_clean[!train_sample, ]
create a linear model with the house size ‘sqft_living’ as a single
variable and test it against the test data set. The assumption in that
model is that the “square footage of house’s living space” can explain
the house price. The quality of the prediction will suffer from the
simplification, that all other variables will have at least some
influence on the house price.
(The summary(model_1) can be found with all the models at the end of the
report.)
# Fit the model using the training set
model_1 <- lm(price ~ sqft_living, data = train_set)
# Predict the target variable using the model and the training set
train_pred_1 <- predict(model_1, newdata = train_set)
# Predict the target variable using the model and the test set
test_pred_1 <- predict(model_1, newdata = test_set)
# Evaluate the performance of the model using the test set
MSE_m1 <- mean((test_set$price - test_pred_1)^2)
MSE_m1
## [1] 67031679108
# construct predictions
grid_1 <- df_clean %>%
data_grid(sqft_living = seq_range(sqft_living, 20)) %>%
# takes the data set and predicts 20 values in the range of min and max values of the
add_predictions(model_1, "price")
library(prodlim)
# Use the model to make predictions on the test set
predictions_1 <- model_1 %>% predict(test_set)
#Examine R-squared, RMSE, and MAE of predictions
RMSE_m1 = sqrt(mean((test_set$price - predictions_1)^2))
RMSE_m1
## [1] 258904.8
Plotting the model against the data. The red line
ggplot(df_clean, aes(sqft_living, price)) +
geom_hex(bins = 50) +
geom_line(data = grid_1, colour = "red", linewidth = 1)+
labs(title = "linear Model (sqft_living)")
df_clean <- df_clean %>%
add_residuals(model_1, "resid_1")
# explore residuals distribution
# versus sqft_living
# Plot using geom_hex
ggplot(df_clean, aes(sqft_living, resid_1)) +
geom_hex(bins = 50) +
geom_abline(slope = 0, color = "red", linewidth = 1)
Our initial model, which considers only the sqft_living models acceptable for houses with small living rooms, but gets progressivly worse for the larger houses.
ggplot(df_clean, aes(sqft_living, price)) +
geom_hex(bins = 50) +
geom_line(data = grid_1, colour = "red", linewidth = 1) +
scale_y_log10() +
scale_x_log10() +
ggtitle("Model 2 - double logscale") +
xlab("log(sqft_living)")+
ylab("log(price)")
scale_fill_viridis_c()
## <ScaleContinuous>
## Range:
## Limits: 0 -- 1
# Fit the model using the training set
model_2 <- lm(log2(price) ~ log2(sqft_living), data = train_set)
# Predict the target variable using the model and the training set
train_pred_2 <- predict(model_2, newdata = train_set)
# Predict the target variable using the model and the test set
test_pred_2 <- predict(model_2, newdata = test_set)
summary(model_2)
##
## Call:
## lm(formula = log2(price) ~ log2(sqft_living), data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.58430 -0.42179 0.01536 0.37105 1.90050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.654921 0.081015 119.2 <2e-16 ***
## log2(sqft_living) 0.841170 0.007425 113.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5598 on 15120 degrees of freedom
## Multiple R-squared: 0.4591, Adjusted R-squared: 0.4591
## F-statistic: 1.284e+04 on 1 and 15120 DF, p-value: < 2.2e-16
# Evaluate the performance of the model using the test set
MSE_m2 <- mean((test_set$price - test_pred_2)^2)
MSE_m2
## [1] 4.23912e+11
Unfortunatly, this model2 seems to be worse than model1, since it has a smaller R2. But since the R2 metric is somewhat flawed, we could still investigate the model, without much hope that it is very good. (It is still fine for the little data we use for the creation.)
# construct predictions in logscale
grid_2 <- df_clean %>%
data_grid(sqft_living = seq_range(sqft_living, 20)) %>%
add_predictions(model_2, "lprice") %>%
mutate(price = 2 ^ lprice)
df_clean <- df_clean %>%
add_residuals(model_2, "resid_2")
# Plotting residuals
ggplot(df_clean, aes(sqft_living, resid_2)) +
geom_hex(bins = 50) +
geom_abline(slope = 0, color = "red", linewidth = 1) +
ggtitle("Model 2 - residuals")
# Plotting Model2 in lin
ggplot(df_clean, aes(sqft_living, price)) +
geom_hex(bins = 50) +
geom_line(data = grid_2, colour = "red", linewidth = 1) +
ggtitle("Model 2 - plot in linear scale")
# Plotting Model2 in double log
ggplot(df_clean, aes(sqft_living, price)) +
geom_bin2d(bins = 50) +
geom_line(data = grid_2, colour = "red", linewidth = 1) +
scale_y_log10() +
scale_x_log10() +
ggtitle("Model 2 - double_log")
annotation_logticks()
## geom_logticks: base = 10, sides = bl, outside = FALSE, scaled = TRUE, short = 0.1, mid = 0.2, long = 0.3
## stat_identity: na.rm = FALSE
## position_identity
since our model only incorperates one value (sqft_living), we would like to add more information to increase the precision/accuracy of our model. Which of our variables could help predict the price of houses? ´The grading of the house, or the Number of bedrooms cpuld help explain our variable.
ggplot(df_clean, aes(x = grade, y = resid_2, fill = as.factor(grade))) +
geom_boxplot() +
geom_abline(slope = 0, color = "red", linewidth = 1)
ggplot(df_clean, aes(x = bedrooms, y = resid_2, fill = as.factor(bedrooms))) +
geom_boxplot() +
geom_abline(slope = 0, color = "red", linewidth = 1)
# Fit the model using the training set
model_3 <- lm(price ~ grade - 1, data = train_set)
# Predict the target variable using the model and the training set
train_pred_3 <- predict(model_3, newdata = train_set)
# Predict the target variable using the model and the test set
test_pred_3 <- predict(model_3, newdata = test_set)
summary(model_3)
##
## Call:
## lm(formula = price ~ grade - 1, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -448806 -206310 -92406 55795 6743191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## grade 73601 336 219 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 320100 on 15121 degrees of freedom
## Multiple R-squared: 0.7603, Adjusted R-squared: 0.7603
## F-statistic: 4.797e+04 on 1 and 15121 DF, p-value: < 2.2e-16
# Evaluate the performance of the model using the test set
MSE_m3 <- mean((test_set$price - test_pred_3)^2)
MSE_m3
## [1] 9.5647e+10
# construct predictions
grid_3 <- df_clean %>%
data_grid(grade) %>%
add_predictions(model_3, "price")
# plot the data + predictions
ggplot(df_clean, aes(grade, y = price, fill = as.factor(grade))) +
geom_boxplot() +
geom_point(data = grid_3, colour = "red", size = 1)
In our third model we systematically underestimate high (and the lowest) graded houses, but overestimate some houses in the middle of the grade range. This trend could also be predicted by remembering the residuals of the m2 against the grade(s). so overall it is highly likely that using a different model than the linear one, will result in much better predictions.
# Fit the model using the training set
model_4 <- lm(price ~ sqft_living + grade + bathrooms + view + bedrooms, data = train_set)
# Predict the target variable using the model and the training set
train_pred_4 <- predict(model_4, newdata = train_set)
# Predict the target variable using the model and the test set
test_pred_4 <- predict(model_4, newdata = test_set)
summary(model_4)
##
## Call:
## lm(formula = price ~ sqft_living + grade + bathrooms + view +
## bedrooms, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1287778 -126500 -20670 95855 4561187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.347e+05 1.735e+04 -25.06 < 2e-16 ***
## sqft_living 2.080e+02 4.225e+00 49.23 < 2e-16 ***
## grade 8.851e+04 2.673e+03 33.11 < 2e-16 ***
## bathrooms -1.769e+04 4.020e+03 -4.40 1.09e-05 ***
## view 9.140e+04 2.667e+03 34.27 < 2e-16 ***
## bedrooms -3.596e+04 2.738e+03 -13.13 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 240000 on 15116 degrees of freedom
## Multiple R-squared: 0.5804, Adjusted R-squared: 0.5803
## F-statistic: 4183 on 5 and 15116 DF, p-value: < 2.2e-16
# Evaluate the performance of the model using the test set
MSE_m4 <- mean((test_set$price - test_pred_4)^2)
MSE_m4
## [1] 55114573361
df_clean <- df_clean %>%
add_residuals(model_4, "resid_4")
# construct predictions
resid_4 = model_4$residuals
ggplot(df_clean, aes(x = price, y = resid_4)) + #, fill = as.factor(grade)
geom_hex(bins = 50) +
geom_abline(slope = 0, color = "red", linewidth = 1) +
ggtitle("Model 4 residuals")
plot the training and test data against simultaneously.
# Combine actual and predicted values for the training set
train_plot_data <- data.frame(Actual = train_set$price, Predicted = train_pred_4)
# Combine actual and predicted values for the test set
test_plot_data <- data.frame(Actual = test_set$price, Predicted = test_pred_4)
# Combine actual and predicted values for the training set and the test set
plot_data <- rbind(
data.frame(Actual = train_set$price, Predicted = train_pred_4, Source = "Training"),
data.frame(Actual = test_set$price, Predicted = test_pred_4, Source = "Test")
)
# Create a scatter plot for both the training set and the test set
# this time with colorblind-friendly colours
ggplot(plot_data, aes(x = Actual, y = Predicted, color = Source)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, linetype = "dashed") +
scale_color_brewer(type="qual", palette = 6) + # use the Qualitative palettes, and from them the 6th combination (eq. to "Set1")
labs(
x = "Actual Price",
y = "Predicted Price",
title = "Actual vs. Predicted Price - Model4"
) +
theme_minimal()
Another variant to plot the residuals is in a QQ-plot, if the distribution follows the 45° line, the distributions should be equal. (This allows for the visual interpretation only.)
qqnorm(df_clean$resid_4)
summary(model_1)
##
## Call:
## lm(formula = price ~ sqft_living, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1515218 -147454 -24095 107165 4328253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -52961.324 5286.198 -10.02 <2e-16 ***
## sqft_living 284.208 2.322 122.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 262600 on 15120 degrees of freedom
## Multiple R-squared: 0.4976, Adjusted R-squared: 0.4976
## F-statistic: 1.498e+04 on 1 and 15120 DF, p-value: < 2.2e-16
summary(model_2)
##
## Call:
## lm(formula = log2(price) ~ log2(sqft_living), data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.58430 -0.42179 0.01536 0.37105 1.90050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.654921 0.081015 119.2 <2e-16 ***
## log2(sqft_living) 0.841170 0.007425 113.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5598 on 15120 degrees of freedom
## Multiple R-squared: 0.4591, Adjusted R-squared: 0.4591
## F-statistic: 1.284e+04 on 1 and 15120 DF, p-value: < 2.2e-16
summary(model_3)
##
## Call:
## lm(formula = price ~ grade - 1, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -448806 -206310 -92406 55795 6743191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## grade 73601 336 219 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 320100 on 15121 degrees of freedom
## Multiple R-squared: 0.7603, Adjusted R-squared: 0.7603
## F-statistic: 4.797e+04 on 1 and 15121 DF, p-value: < 2.2e-16
summary(model_4)
##
## Call:
## lm(formula = price ~ sqft_living + grade + bathrooms + view +
## bedrooms, data = train_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1287778 -126500 -20670 95855 4561187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.347e+05 1.735e+04 -25.06 < 2e-16 ***
## sqft_living 2.080e+02 4.225e+00 49.23 < 2e-16 ***
## grade 8.851e+04 2.673e+03 33.11 < 2e-16 ***
## bathrooms -1.769e+04 4.020e+03 -4.40 1.09e-05 ***
## view 9.140e+04 2.667e+03 34.27 < 2e-16 ***
## bedrooms -3.596e+04 2.738e+03 -13.13 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 240000 on 15116 degrees of freedom
## Multiple R-squared: 0.5804, Adjusted R-squared: 0.5803
## F-statistic: 4183 on 5 and 15116 DF, p-value: < 2.2e-16
# Store the MSEs of the models in a vector
MSEs <- c(MSE_m1, MSE_m2, MSE_m3, MSE_m4)
# Create a bar plot
ggplot(data.frame(
Models = factor(c("lin_sqft", "sqft log-log-scale", "lin_grade", "multi_lin"),
levels = c("lin_sqft", "sqft log-log-scale", "lin_grade", "multi_lin")),
MSEs = MSEs),
aes(x = Models, y = MSEs)) +
geom_bar(stat = "identity") +
labs(
x = "Models",
y = "Mean Squared Error",
title = "MSE of Different Models"
) +
theme_minimal()
# Create a new plotting window and set the plotting area into a 2*2 array
par(mfrow = c(2, 2))
# Print the summary of the first model
plot(model_1)
mtext("Statistics Model 1", side = 3, line = -2, outer = TRUE, font=4)
plot(model_2)
mtext("Statistics Model 2", side = 3, line = -2, outer = TRUE, font=4)
plot(model_3)
mtext("Statistics Model 3", side = 3, line = -2, outer = TRUE, font=4)
plot(model_4)
mtext("Statistics Model 4", side = 3, line = -2, outer = TRUE, font=4)
library(leaps)
# transform the 'sqft_living' into numeric to work with the subset selection
train_set$sqft_living <- as.numeric(train_set$sqft_living)
test_set$sqft_living <- as.numeric(test_set$sqft_living)
# best subset selection
bestSubsets <- regsubsets(price ~ sqft_living + grade + bathrooms + view + bedrooms, data = train_set)
str(test_set)
## 'data.frame': 6490 obs. of 14 variables:
## $ X : int 2 3 4 7 12 13 14 15 19 21 ...
## $ price : num 538000 180000 604000 257500 468000 ...
## $ bedrooms : int 3 2 4 3 2 3 3 5 2 4 ...
## $ bathrooms : num 2.25 1 3 2.25 1 1 1.75 2 1 1.75 ...
## $ sqft_living: num 2570 770 1960 1715 1160 ...
## $ sqft_lot : int 7242 10000 5000 6819 6000 19901 9680 4850 9850 4980 ...
## $ floors : num 2 1 1 2 1 1.5 1 1.5 1 1 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 5 3 4 4 4 3 4 4 ...
## $ grade : int 7 6 7 7 7 7 7 7 7 7 ...
## $ yr_built : int 1951 1933 1965 1995 1942 1927 1977 1900 1921 1947 ...
## $ zipcode : Factor w/ 70 levels "98001","98002",..: 56 17 59 3 50 17 38 46 2 58 ...
## $ lat : num 47.7 47.7 47.5 47.3 47.7 ...
## $ long : num -122 -122 -122 -122 -122 ...
reg.summary<-summary(bestSubsets)
reg.summary
## Subset selection object
## Call: regsubsets.formula(price ~ sqft_living + grade + bathrooms +
## view + bedrooms, data = train_set)
## 5 Variables (and intercept)
## Forced in Forced out
## sqft_living FALSE FALSE
## grade FALSE FALSE
## bathrooms FALSE FALSE
## view FALSE FALSE
## bedrooms FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
## sqft_living grade bathrooms view bedrooms
## 1 ( 1 ) "*" " " " " " " " "
## 2 ( 1 ) "*" " " " " "*" " "
## 3 ( 1 ) "*" "*" " " "*" " "
## 4 ( 1 ) "*" "*" " " "*" "*"
## 5 ( 1 ) "*" "*" "*" "*" "*"
# output: in every row you see a model with the highest R squared
# for the given number of regressors,
# and which variables it includes
# extraxt the adj. R² of each model
reg.summary$adjr2
## [1] 0.4975829 0.5382433 0.5741356 0.5797987 0.5803085
this means for the best regression model we should use the variables additive in the following order: sqft_living, view, bedrooms, bathrooms;
# plot Mallows’ cp
plot(reg.summary$cp, xlab = "Number of variables", ylab = "C_p",type = "l")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)],
col = "red", cex = 2, pch = 20)
coef(bestSubsets, which.min(reg.summary$cp))
## (Intercept) sqft_living grade bathrooms view bedrooms
## -434708.6646 207.9644 88512.1370 -17691.0212 91397.1169 -35961.5719
# BIC
plot(reg.summary$bic, xlab = "Number of variables", ylab = "BIC", type = "l")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)],
col = "red", cex = 2, pch = 20)
coef(bestSubsets, which.min(reg.summary$bic))
## (Intercept) sqft_living grade bathrooms view bedrooms
## -434708.6646 207.9644 88512.1370 -17691.0212 91397.1169 -35961.5719
bestSubsets.fwd <- regsubsets(price ~ sqft_living + grade + bathrooms + view + bedrooms, data = train_set, method = "forward")
# regfit.fwd <- regsubsets(lpsa ~. -train, data = prostate_df[prostate_df$train, ],method = "forward")
reg.summary.fwd <- summary(bestSubsets.fwd)
# which model is the best according to Cp (forwards)
coef(bestSubsets.fwd, which.min(reg.summary.fwd$cp))
## (Intercept) sqft_living grade bathrooms view bedrooms
## -434708.6646 207.9644 88512.1370 -17691.0212 91397.1169 -35961.5719
# which model is the best according to BIC (forwards)?
coef(bestSubsets.fwd, which.min(reg.summary.fwd$bic))
## (Intercept) sqft_living grade bathrooms view bedrooms
## -434708.6646 207.9644 88512.1370 -17691.0212 91397.1169 -35961.5719
# Construct the best model according to BIC
lm_best_fwd <- lm(price ~ sqft_living + grade + bathrooms + view + bedrooms, data = train_set)
# Make predictions using matrix multiplication of the test matrix and the coefficients vector
predict_best_fwd <- predict(lm_best_fwd, test_set)
# Calculate the MSE
mse_fwd <- mean((test_set$price - predict_best_fwd)^2)
mse_fwd
## [1] 55114573361
The Data itself, seem to be of high quality. There are no missing values, and the data set has about 22-thousand entries, in 13 categories. Additionally the entries themselves, seem to be correct, with no erroneous entries, which in itself is a good thing and helps build trust in the analysis.
As explored in the Data Exploration section, we could build a model, which considers the “waterfront” variable and interpret the data as a combination of two distinct populations, one with and one without waterfront, since this influences the house price.
Modelling Based on the adjusted R2-statistics, the
most accurate price-model is based just on grade (0.7603), instead of
the multiple linear model. The reasons for that could be manifold, some
of the them could be:
Overfitting: If only one of the predictors is responsible for the
outcome, and all other variables add noise and no additional
information, than a linear model will perform better, than a multiple
linear model.
non - linear-relationships: This is likely the case in our
models. The visualization in the data exploration @section_plots can provide more details. shows,
that there are some likely candidates, for example:
bedrooms: gaussian (normal) distribution
sqft_lot could be a combination of two Poisson distributions (one for ‘small’ lots/gardens, and oen for ‘large’ lots/gardens).
Conclusion:
In general the good quality of data allows for various transformations and interpretation possibilities. Here, some simple models were build on individual or multiple variables, with the best performing model as a linear model, based on grade of the house.